Project Teleco Customer Churn Prediction

Domain : Telecom

• Context:

A telecom company wants to use their historical customer data to predict behaviour to retain customers. You can analyse all relevant customer data and develop focused customer retention programs.

• Data Description:

Each row represents a customer, each column contains customer’s attributes described on the column Metadata. The data set includes information about:

• Customers who left within the last month – the column is called Churn

• Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies

• Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges

• Demographic info about customers – gender, age range, and if they have partners and dependents

Project Objective:

Build a model that will help to identify the potential customers who have a higher probability to churn. This help the company to understand the pinpoints and patterns of customer churn and will increase the focus on strategising customer retention.

• Steps to the project: [ Total score: 60 points ]

  1. Import and warehouse data: [ Score: 5 point ] • Import all the given datasets. Explore shape and size. • Merge all datasets onto one and explore final shape and size.
  2. Data cleansing: [ Score: 10 point ] • Missing value treatment • Convert categorical attributes to continuous using relevant functional knowledge • Drop attribute/s if required using relevant functional knowledge • Automate all the above steps
  3. Data analysis & visualisation: [ Score: 10 point ] • Perform detailed statistical analysis on the data. • Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.
  4. Data pre-processing: [ Score: 5 point ] • Segregate predictors vs target attributes • Check for target balancing and fix it if found imbalanced. • Perform train-test split. • Check if the train and test data have similar statistical characteristics when compared with original data.
  5. Model training, testing and tuning: [ Score: 25 point ] • Train and test all ensemble models taught in the learning module. • Suggestion: Use standard ensembles available. Also you can design your own ensemble technique using weak classifiers. • Display the classification accuracies for train and test data. • Apply all the possible tuning techniques to train the best model for the given data. • Suggestion: Use all possible hyper parameter combinations to extract the best accuracies. • Display and compare all the models designed with their train and test accuracies. • Select the final best trained model along with your detailed comments for selecting this model. • Pickle the selected model for future use.
  6. Conclusion and improvisation: [ Score: 5 point ] • Write your conclusion on the results. • Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the telecom operator to perform a better data analysis in future.

Telecom Customer Churn Prediction

Introduction

Customer Churn or Customer Turnover refers to when a Customer ceases services with a company. Churn Prediction is a subset of problem that can be extend to many area such as employees in a company, Customer Churn from a mobile subscription etc. We are going to use the Telecom data to predict Churn. After loading the the data, we will explore attributes and different relationships between them before building our model.

Customer Attrition

Customer attrition, also known as customer churn, customer turnover, or customer defection, is the loss of clients or customers.

Telephone service companies, Internet service providers, pay TV companies, insurance firms, and alarm monitoring services, often use customer attrition analysis and customer attrition rates as one of their key business metrics because the cost of retaining an existing customer is far less than acquiring a new one. Companies from these sectors often have customer service branches which attempt to win back defecting clients, because recovered long-term customers can be worth much more to a company than newly recruited clients.

Companies usually make a distinction between voluntary churn and involuntary churn. Voluntary churn occurs due to a decision by the customer to switch to another company or service provider, involuntary churn occurs due to circumstances such as a customer's relocation to a long-term care facility, death, or the relocation to a distant location. In most applications, involuntary reasons for churn are excluded from the analytical models. Analysts tend to concentrate on voluntary churn, because it typically occurs due to factors of the company-customer relationship which companies control, such as how billing interactions are handled or how after-sales help is provided.

Predictive Analytics use churn prediction models that predict customer churn by assessing their propensity of risk to churn. Since these models generate a small prioritized list of potential defectors, they are effective at focusing customer retention marketing programs on the subset of the customer base who are most vulnerable to churn.

Big 5V's

1. Volume:

Observation:

1) The data set has 21 Rows X 7043 columns total of 147903 entries

2) Out of 21 columns 17 columns have 2-4 unique data predominantly as binary values, "Yes", "No", "Male", "Female",

Recommendation:

1) Since the dataset is all about Telecom Churn additional features like the following attributing to the Churn would have added more value

A) Reasons for Churn

B) Customer Segmentation [Consumer, MSME, SME, Corporate, Key Account]

C) Geographic Location to show Marketing segmentation [High ARPU, Low ARPU]

D) Outstanding collections/ receiveables from the customer [0-30, 30-60, 60-90, >90 days]

E) Citing poor or no network coverage

2. Velocity:

Observation:

1) The dataset doesn't indicate the duration and the frequency of the data collected

2) The customer chrun is the key feature in the dataset, it is not evident that how frequently the data is shared when a customer churns

Recommendation:

1) To improve Velocity the dataset shall be refreshed when a customer churns

3. Variety:

Observation:

The dataset provides a variety of features of a customer with which we shall interpret how it relates to a churn like

1) 90% of customers had a phone connection

2) 45% of customers had multiple phone lines

3) 70% of customers had a fibre optic connection

4) 78% of customers didn't had a online security

Recommendation:

To improve consistency to the dataset following measures shall be considered

1) What lead the customer to churn (reasons)

2) How many cummulative features lead to churn

4. Variability:

Observation:

The dataset has 21 columns of which we had one feature has data type int64 and couple of features had data type as float 64

Recommendation:

To add more variablilty to the data set key features attributing to churn like CSAT Survey, trouble ticket history will provide the right essence

5. Veracity:

Observation:

In our dataset the features like Monthly charges, total charges and tenure were float and int 64 data types. The accuracy of this information will help to assess the impact due to churn

Recommendation:

To improve the accuracy of data source the tenure, monthly charges and total charges can be segmented like 0-6 months, 6months to 1 year and Payment charges like 0-500, 500-1000 etc

6. Visualization:

Observation:

For the given data set the volume isn't huge hence Box Plot, Pair Plot, Histogram and Pie Charts are quite sufficient to represent the data set

Recommendation:

If the data set is huge we can look at using Spectogram, Scatter plots, Geo Charts etc

7. Value:

Observation:

The information provided in the data set can be categorized as nice to have and some essential information was lagging

Recommendation:

To predict a chrun other vital information like the following would have added more value and lead to a data driven prediction A) Reasons for Churn

B) Customer Segmentation [Consumer, MSME, SME, Corporate, Key Account]

C) Geographic Location to show Marketing segmentation [High ARPU, Low ARPU]

D) Outstanding collections/ receiveables from the customer [0-30, 30-60, 60-90, >90 days]

E) Citing poor or no network coverage

1.Data Overview

Understanding the Dataset

The dataset has 21 attributes and below is the definition:

Exploratory Data Analayis

In this section, we will first do an exploratory data analysis by exploring most attributes and check their contribution or how they are related to customers churn. We will follow the steps below:

1. Listing Statistical Properties

Before running our Statistic, we will take a look at the Data Type.

1.1. Data Overview

1.1. Delete customerID Column

Since 'customerid' column does not provide any relevant information in predicting the customer churn, we can delete the column.

2. Data Manipulation

2.1. Checking for Null Values in the Dataset

As of now we don't see any null values. However, we will find a few in the TotalCharges column after casting it to float64

It can also be noted that the Tenure column is 0 for these entries even though the MonthlyCharges column is not empty. Let's see if there are any other 0 values in the Tenure column.

There are no additional missing values in the Tenure column. Let's delete the rows with missing values in MonthlyCharges and tenure columns.

3. Exploratory Data Analysis

Plot insights:

Senior citizens churn rate is much higher than non-senior churn rate.

Churn rate for month-to-month contracts much higher that for other contract durations.

Moderately higher churn rate for customers without partners.

Much higher churn rate for customers without children.

Payment method electronic check shows much higher churn rate than other payment methods.

Customers with InternetService fiber optic as part of their contract have much higher churn rate.

3.1. Customer Attrition in Data

Inspecting the mean attributes of customers who churn

As we can see, customers who churn seems on average to stay less in the company and have a monthly greater charges compare to those who do not churn. Their total charges is lower than customers that do not churn.

From the observations above, it looks like:

Inspecting Churn by Gender

There are more male customer than female customers. But box sexes seems to churn with the same percentage.

Churn by Contract Type

Most customers are month to month customers, they churn more than customers who subscribe for one year or two years contrats.

Churn by Payment Method

Customers who pay by Electronic check seems to churm more than customers who pay by mailed check, bank transfer or credit card. Mailed check, bank transfer or credit card customers seems to churn in about the same rate.

Churn by Montly Rate

Customers who are charged less that 40 a month seems to churn less. As the monthly rate increase, they churn more. Customers who churn the most pay between 70-100 a month.

Customers who pay by Electronic check seems to churm more than customers who pay by mailed check, bank transfer or credit card. Mailed check, bank transfer or credit card customers seems to churn in about the same rate.

Churn by Total Charges

Customers who have a total balance less than 1500 seems to churn more than customers with higher balance.

Churn by Monthy Charges and Tenure

Non senior citizens churn more that senior citizens

Detecting Outliers

An outlier is a value that lies at an abnormally high distance from other values in the dataset. It can be much smaller or much larger. Basically, it doe not show the same pattern as other values. We will be using interquartile range(IQR) to detect outliers. The interquartile range is te range between the first quartile(Q1) and the third quartile (Q3). With this approach, any value which is more than 1.5 IQR+Q3 or less than Q1 - 1.5 IQR is considered as outlier. We will check the outlier in price.

We are going to draw the boxplot for the tenure column and get the outlier list.

Based on the method used here to detect outliers, all values seems to be in the normal range. Therefore, our dataset does not have outliers.

3.2. Varibles Distribution in Customer Attrition

Several of the numerical data are very correlated. (Total day minutes and Total day charge), (Total eve minutes and Total eve charge), (Total night minutes and Total night charge) and lastly (Total intl minutes and Total intl charge) are alo correlated. We only have to select one of them.

3.3. Customer Attrition in Tenure Groups

3.4. Monthly Charges and Total Charges by Tenure and Churn groups

3.5. Average Charges by Tenure Groups

3.6. Monthly Charges,Total Charges and Tenure in Customer Attrition

4. Data Pre - Processing

3.7. Variable Summary

Correlation Analysis for Churn with Remaining Features

3.8. Correlation Matrix

Inference: There is some correlation between 'phone service' and 'multiple lines' since those who don't have a phone service cannot have multiple lines. So, knowing that a particular customer is not subscribed to phone service we can infer that the customer doesn't have multiple lines. Similarly, there is also a correlation between 'internet service' and 'online security', 'online backup', 'device protection', 'streaming tv' and 'streaming movies'

3.10. Binary Variables Distribution in Customer Attrition(Radar Chart)

3. Data Pre - Processing

Data needs to be Label before applying machine learning models. Feature engineering In this section, we will find the feature that are more predictive for our model. Before proceed to our features engineering, we are going to map all the string boolean to numeric boolean

For other categorical features, we will do a label encoding to transform them to binary. For each variable that has n features, we will create n-1 features.Basically Label encoding creates a dummy feature for each unique value in the nominal feature and assign 1 if it has a value and 0 otherwise.

5. Model Building

5.1. Baseline Model

Model Building

In this section, we are going to build our first model. We are going to choose find different machine algorithms to train our base model using all features, then select the one that perform well to tune in order to have better accuracy.

Selecting Machine Learning Algorithms

This is a classification problem, we want to predict whether or not a customer will churn. Here are the classifications that we will explore:

Splitting the Data in Training Set and Test Set

We are going to keep 70% of data for training and 30% for testing. Based on our analysis above, we saw that about 73% of customers did not churn and about 27%, which is somewhat unbalanced. We are going to add the argument stratify=y to make sure that both training and test datasets have the same class proportions as the original dataset.

Logistic Regression - Base Model

5.2. Synthetic Minority Oversampling Technique (SMote)

5.3. Recursive Feature Elimination

Recursive Feature Elimination (RFE) is based on the idea to repeatedly construct a model and choose either the best or worst performing feature, setting the feature aside and then repeating the process with the rest of the features. This process is applied until all features in the dataset are exhausted. The goal of RFE is to select features by recursively considering smaller and smaller sets of features.

5.4. Univariate Selection

5.5. Decision Tree Visualization

* Using Top Three Categorical Features

5.6. KNN Classifier

5.7. Vizualising a Decision Tree from Random Forest Classifier

5.8. Random Forest Classifier.

5.9. Gaussian Naive Bayes.

5.10. Support Vector Machine

5.11. Tuning Parameters for Support Vector Machine

5.12. LightGBM Classifier

5.13. XGBoost Classifier

5.14. AdaBoost Classifier

5.15. GradientBoosting Classifier

5.16. Bagging Classifier

5.17. CatBoost Classifier

6. Model Performances

6.1. Model Performance Metrics

For performance assessment of the chosen models, various metrics are used: 1.Feature weights: Indicates the top features used by the model to generate the predictions

2.Confusion matrix: Shows a grid of true and false predictions compared to the actual values

3.Accuracy score: Shows the overall accuracy of the model for training set and test set

4.ROC Curve: Shows the diagnostic ability of a model by bringing together true positive rate (TPR) and false positive rate (FPR) for different thresholds of class predictions (e.g. thresholds of 10%, 50% or 90% resulting to a prediction of churn)

5.AUC (for ROC): Measures the overall separability between classes of the model related to the ROC curve

6.Precision-Recall-Curve: Shows the diagnostic ability by comparing false positive rate (FPR) and false negative rate (FNR) for different thresholds of class predictions. It is suitable for data sets with high class imbalances (negative values overrepresented) as it focuses on precision and recall, which are not dependent on the number of true negatives and thereby excludes the imbalance

7.F1 Score: Builds the harmonic mean of precision and recall and thereby measures the compromise between both.

8.AUC (for PRC): Measures the overall separability between classes of the model related to the Precision-Recall curve

6.2. Compare Model Metrics

6.3. Confusion Matrices for Models

6.4. ROC - Curves for Models

6.5. Precision Recall Curves

7. Hyperparameter Tuning For Best Model

Model Optimization

In this section, we are going to try to improve the accuracy of our model. We will first focus on below techniques:

Cross Validation is a technique that consist of dividing the data in multiple folds(k), and at each iteration, using one k-1 fold for training and one fold for validation. This will help to avoid overfitting(our model does not generalize properly on unseen data) and help us choosing the best model. The general term is k-fold cross validation which k in the number of fold the training data is split into.

Hyperparameter Tuning consist of feeding our model with a range of paramters and consider the one that allow the model generate better accuracy.

Hyperparameter Tuning/Model Improvement

To address a potential bias stemming from the specific split of the data in the train-test-split part, cross-validation is used during hyperparameter tuning with Grid Search and Randomized Search. Cross validations splits the training data into in a specified amount of folds. For each iteration one fold is held out as “training-dev” set and the other folds are used as training set. Result of cross-validation is k values for all metrics on the k-fold CV.

7.1 . Logistic Regression Hypertuning

Define a function that plots the ROC curve and the AUC score

Define a function that plots the precision-recall-curve and the F1 score and AUC score

Plot Model Evaluations

7.2 . AdaBoost Classifier Hypertuning

Define a Function that plots the ROC Curve and the AUC Score ,Grid Search for Adaboost Classifier and running with Optimized Hyperparameters

7.3 . Random Forest Hypertuning

Plot Model Evaluations

7.4 . GradientBoost Hypertuning

Plot Model Evaluations

7.5 . XGBoost Hypertuning

7.6 . Cat Boost Hypertuning

7.7 . Bagging Classifier Hypertuning

7.8 . LGBM Classifier Hypertuning

8. Model Performances Post Hypertuning

8.1. Model Performance Post Hypertuning Metrics

Function for CatBoost Algorithm to print values without learn values - Verbose = False and Include in main hyperparameter function

8.2. Compare Optimized Model Metrics

8.3. Confusion Matrix for Optimized Models

8.4. ROC - Curves for Optimized Models

8.5. Precision Recall Curves for Optimized Models

9. Conclusion , Inferences , Observations & Future Work

  1. We were able to achieve above 93 % Accuracy , 86 % recall, 88 % precision, and 0.87% F1 by Gradboost Algorithm. This equates to correctly identifying 86 % of customer churn cases, while unnecessarily targeting loyal customers 88% of the time. Assuming reasonable actions are taken, i.e. emailing the customer an offer, this model could be leveraged to improve customer satisfcation.
  2. Gradient Boost Classifier performed the best post Hypertuning with Accuracy of → 93 % , Recall Curve → 86 % , F1 Score → 87 % & Precision of → 88 %.As expected , SMOTE oversampling gave excellent results than without upsampling.We used SMote techniques as the data was highly imbalanced and could see the model improve significantly .
  3. XG Boost is the modern day high performing algorithm and can be scaled further in terms of hypertuning parameters using other techniques like hyperband , bayes-opt & RayTune compared to GridSearchCV and RandomSearchCV to perform better than Gradient Boost in Accuracy , Recall scores & Precision Scores
  4. XGBoost Classifier performed the 2nd best post Hypertuning(High Increase) with Accuracy of → 91 % , Recall Curve → 87 % , F1 Score → 85 % & Precision of → 83 % .This equates to correctly identifying 83 % of customer churn cases, while unnecessarily targeting loyal customers 83 % of the time. Assuming reasonable actions are taken, i.e. emailing the customer an offer, this model could be leveraged to improve customer satisfcation.As expected , SMOTE oversampling gave excellent results than without upsampling.We used SMote techniques as the data was highly imbalanced and could see the model improve significantly . Used XGBoost parameter "scale_pos_weight" to handle class imbalance .
  5. XGBoost has large number of hyperparameters to tune, among the hyperparameter tuning libraries, we used GridSearch technique which works on random search and is slow in execution . However hyperband , bayes_opt and scikit-learn (GridSearchCV, RandomSearchCV) were inferior and/or more time-consuming in comparison to hyperband , Optuna and RayTune Hypertuning techniques.XGBoost with Hyperopt, Optuna, and Ray may provide better result as compared to XGBoost with GridSearchCV . It can be also be experimented to find out whether Bayesian optimization tunes faster with a less manual process vs. sequential tuning .
  6. Light Gradient Boost Classifier(LGBM) performed 3rd best and significantly better post Hypertuning with Accuracy of → 89 % , Recall Curve → 84 % , F1 Score → 80 % & Precision of → 77 %.As expected , SMOTE oversampling gave excellent results than without upsampling.We used SMote techniques as the data was highly imbalanced and could see the model improve significantly .This equates to correctly identifying 84 % of customer churn cases, while unnecessarily targeting loyal customers 77 % of the time. Assuming reasonable actions are taken, i.e. emailing the customer an offer, this model could be leveraged to improve customer satisfcation.
  7. CatBoost Classifier performed 4th best and significantly better post Hypertuning with Accuracy of → 83 % , Recall Curve → 83 % , F1 Score → 72 % & Precision of → 62 %.SMOTE oversampling was not able to uplift the results as compared to the base model & without upsampling.We used SMote techniques as the data was highly imbalanced and could see the model improve significantly .This equates to correctly identifying 83 % of customer churn cases, while unnecessarily targeting loyal customers 62 % of the time. Assuming reasonable actions are taken, i.e. emailing the customer an offer, this model could be leveraged to improve customer satisfcation.
  8. AdaBoost Classifier performed the 5th best post Hypertuning(Marginal Increase) with Accuracy of → 81 % , Recall Curve → 70 % , F1 Score → 65 % & Precision of → 62 % .This equates to correctly identifying 70 % of customer churn cases, while unnecessarily targeting loyal customers 62 % of the time. Assuming reasonable actions are taken, i.e. emailing the customer an offer, this model could be leveraged to improve customer satisfcation.

9.1 Best Model Optimization Summary & Future Outlook

Model Summary

  1. We were able to achieve above 93 % Accuracy , 86 % recall, 88 % precision, and 0.87% F1 by Gradboost Algorithm. This equates to correctly identifying 86 % of customer churn cases, while unnecessarily targeting loyal customers 88% of the time. Assuming reasonable actions are taken, i.e. emailing the customer an offer, this model could be leveraged to improve customer satisfcation. Given the high imbalance of the data towards non-churners, it makes sense to compare F1 scores of 87 % and precision of 88 % , to get the model with the best score on joint accuracy , precision and F1 Score . This would also be the GradBoost and XG Boost Algorithm with a recall score of 86 % & 87 % . XG Boost Algorithm can be further optimized using Hyperband , BayesOpt , Optuna and RayTune for becoming the best model

  2. XGBoost Classifier performed the 2nd best post Hypertuning(High Increase) with Accuracy of → 91 % , Recall Curve → 87 % , F1 Score → 85 % & Precision of → 83 % .This equates to correctly identifying 83 % of customer churn cases, while unnecessarily targeting loyal customers 83 % of the time.Looking at model results, the best accuracy on the test set is achieved by the XG Boost Classifer Algorithm with 91